CFAES Bioinformatics Core, Ohio State University
2026-01-29
Determining the sequence of DNA, RNA, or protein fragments.
Most commonly, especially in “high-throughput” sequencing, it refers to DNA sequencing specifically.This week and next, we will focus on DNA sequencing only, keeping in mind that:
RNA can be, and usually is, sequenced via DNA sequencing
How is that done and why?
RNA is usually reverse transcribed to DNA (cDNA) prior to sequencing.
While it is becoming more feasible to directly sequence RNA molecules, RNA is an unstable molecule that is easily degraded and harder to sequence.
High-throughput sequencing (HTS)
Sequences 105-109, usually randomly selected, DNA fragments at a time — two types:
Short-read HTS: More accurate, shorter reads (since 2005)
Long-read HTS: Less accurate, longer reads (since 2011)
These sequenced fragments of DNA are usually called reads
A, C, G, T)The entire human genome was sequenced with Sanger technology!
How many basepairs is that? Want to guess how much this cost?
https://www.genome.gov/about-genomics/fact-sheets/Sequencing-Human-Genome-cost
Image generated by Adobe Firefly
Some present-day uses of Sanger sequencing include:
Taxonomic identification of samples
Examining variation among individuals or populations in one or a few candidate or marker genes
Let’s start with the big picture – HTS data underlies several of these main “omics” approaches:
Copyright ThermoFisher
| Omics type | Molecule type | |
|---|---|---|
| Genomics | DNA | |
| Epigenomics | DNA modifications | High-throughput sequencing (HTS) |
| Transcriptomics | RNA | |
| Proteomics | Proteins | |
| Metabolomics | Metabolites |
The “omics” suffix indicates the involvement of large-scale datasets — in the sense that, for example, “genomics” data typically spans much or all of the genome.
While the boundaries can be fuzzy, sequencing a single gene in a single organism is not genomics, and running qPCR for a handful of genes is not transcriptomics.| Omics type | Molecule type | Data mainly produced by |
|---|---|---|
| Genomics | DNA | High-throughput sequencing (HTS) |
| Epigenomics | DNA modifications | High-throughput sequencing (HTS) |
| Transcriptomics | RNA | High-throughput sequencing (HTS) |
| Proteomics | Proteins | Mass Spectrometry |
| Metabolomics | Metabolites | Mass Spectrometry |
TBA
| Short-read HTS | Long-read HTS | |
|---|---|---|
| Main companies | Illumina | Oxford Nanopore Technologies (ONT) & Pacific Biosciences (PacBio) |
| Short-read HTS | Long-read HTS | |
|---|---|---|
| Usage | More | Less (but increasing) |
| Main companies | Illumina | Oxford Nanopore Technologies (ONT) & Pacific Biosciences (PacBio) |
| Timeline | Since 2005 — technology fairly stable | Since 2011 — still rapid development |
| Short-read HTS | Long-read HTS | |
|---|---|---|
| Usage | More | Less (but increasing) |
| Main companies | Illumina | Oxford Nanopore Technologies (ONT) & Pacific Biosciences (PacBio) |
| Timeline | Since 2005 — technology fairly stable | Since 2011 — still rapid development |
| Read lengths | 50-300 bp | 10-100+ kbp |
| Error rates | Mostly <0.1% | 1-10% (ONT) / <0.1-10% (PacBio) |
| Short-read HTS | Long-read HTS | |
|---|---|---|
| Usage | More | Less (but increasing) |
| Main companies | Illumina | Oxford Nanopore Technologies (ONT) & Pacific Biosciences (PacBio) |
| Timeline | Since 2005 — technology fairly stable | Since 2011 — still rapid development |
| Read lengths | 50-300 bp | 10-100+ kbp |
| Error rates | Mostly <0.1% | 1-10% (ONT) / <0.1-10% (PacBio) |
| Throughput | Higher | Lower |
| Cost per base | Lower | Higher |
For example:
For example:
Here, genomic locations (variant analysis) or gene identities (RNA-Seq) can be reliably inferred from as little as 25 bp.
A read’s sequence may differ from the actual DNA sequence it came from:
When you receive HTS reads, base calls have typically been made already.
Every base call is accompanied by a quality score, representing the estimated error probability.
To overcome sequencing errors, every base can be sequenced multiple times –
i.e., obtaining a “depth of coverage” greater than 1:
Typical depths of coverage are ~50-100x for genome assembly and 10-30x for “resequencing” (!)
Multiplexing!
Adapters can include “indices” or “barcodes” to identify individual samples, so many samples can be combined (multiplexed) into a single library.
“Adapter read-through”: the final bases in the resulting reads will consist of adapter sequence, which should be removed before downstream analysis
Overlapping reads (this can be useful!):
Sequencing is performed by synthesizing a new strand using fluorescently-labeled bases and taking a picture each time a new nucleotide is incorporated:
Sequencing is performed by synthesizing a new strand using fluorescently-labeled bases and taking a picture each time a new nucleotide is incorporated:
Sequencing is performed by synthesizing a new strand using fluorescently-labeled bases and taking a picture each time a new nucleotide is incorporated:
Many HTS applications either require a “reference genome” or involve its production.
What exactly does reference genome refer to? It usually includes:
An assembly
A representation of most or all of the genome DNA sequence: the genome assembly
An annotation
Provides e.g. locations of genes and other genomic “features” in the corresponding genome assembly, and functional information for these features
Taxonomic identity
Reference genomes are typically applicable at the species level. For example, if you work with maize, you want a Zea mays reference genome. But:
https://en.wikipedia.org/wiki/Genome_size
Konkel and Slot (2023)
With increasing usage & quality of long-read HTS, assemblies are getting better and better
Chromosome-level assemblies require additional technologies (e.g., Hi-C)
Many assemblies instead consist of –often 1000s of– fragments (contigs and scaffolds)
How is this data stored?
Both genome assemblies and annotations are typically saved in a single text file each — we’ll explore some of these files in tomorrow’s lab.
You’ve learned:
That high-throughput sequencing (HTS) enables large-scale DNA sequencing
How short-read and long-read HTS have different strengths and weaknesses
About libraries and the technology underlying short-read sequencing
The labs this and next week are organized around the data set from Garrigós et al. (2025):
This paper uses paired-end Illumina RNA-Seq data to study gene expression in Culex pipiens mosquitos infected with two different malaria-causing Plasmodium protozoans.